-
Notifications
You must be signed in to change notification settings - Fork 910
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement pseudo Huber loss for Flux and SD3 #1808
Conversation
This is great, I was literally just considering asking for Huber loss to be implemented for kohya Flux somehow. It was one of the biggest steps forward for my SDXL training. I'll try to give it a go soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, this is great. I think it makes sense to split get_timesteps_and_huber_c
.
However, regarding the responsibilities of the methods, I personally don't want to pass args
to conditional_loss
. We may refactor it after merging to call get_timesteps
, get_huber_threshold
, and conditional_loss
in order from each script. We appreciate your understanding.
I also do not like the current approach, there should be a "step context" object that collects information about the training context, then it would get passed to all the various methods/strategy/utility functions instead of having to pass around a handful of separate parameters everywhere. But that is a bigger refactoring than I'd like to make here. |
@@ -5821,29 +5828,10 @@ def save_sd_model_on_train_end_common( | |||
huggingface_util.upload(args, out_dir, "/" + model_name, force_sync_upload=True) | |||
|
|||
|
|||
def get_timesteps_and_huber_c(args, min_timestep, max_timestep, noise_scheduler, b_size, device): | |||
def get_timesteps(min_timestep, max_timestep, b_size, device): | |||
timesteps = torch.randint(min_timestep, max_timestep, (b_size,), device="cpu") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would this work under the current implementation of sd3 in this repo? haven't looked if it was updated for a while.
but as far as i can recall, the scheduling was implemented in this repo, and in different way and didnt have the 'add_noise' like other implementations from diffusers.
so essentially, it was randomizing the index for timestep, but then taking the timestep itself from the noise_scheduler and calaculating the noise using sigmas.
maybe this changed, haven't looked in the repo. but worth a double check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that is an issue, then it is separate from the scope of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add_noise
can be found here and is not significantly different from other repositories.
sd-scripts/library/train_util.py
Line 5867 in 740ec1d
noisy_latents = noise_scheduler.add_noise(latents, noise, timesteps) |
Only SD3/3.5 (sd3_train.py or sd3_train_network.py) doesn't use get_noise_noisy_latents_and_timesteps()
.
I gave it a try... I'm not seeing good results so far. :( I went for:
Learning with Huber loss enabled seemed a lot slower for the same LR, and I didn't see any big image quality improvements either. I pushed up the LR, but that just got me the usual grid-shaped noise pattern. I was using a gradient accumulation size of 16. Maybe it's dataset-specific, and it'll work better for someone else? My training dataset is around 200 images in size, mostly high quality, but with some poorer quality ones mixed in, so I hoped Huber would help with that. |
I've had to increase the LR as well in my tests, typically by around 20-50%, depending on I am not sure about the best If the learning is not progressing as fast as before I recommend increasing |
Also, from past experience with this type of loss function in other models I don't expect this to improve image details quality much, this usually helps more with getting consistency in the outputs. |
I have some better news to report now. Without changing anything from the settings I gave above, I've let the training continue longer. And the image quality has been rising steadily for a while now, and I'm getting some nice output. Maybe Huber just has different learning characteristics. |
I usually observe this kind of thing when pushing LoRA strength way above 1.0 during inference, or when applying multiple LoRAs simultaneously without lowering the strength of each. I suspect this is related to having very large weight magnitude due to stacking, or over-training. Training at 5e-4 seems super high to me, I usually don't go above 2e-4. What are your |
I usually have a lower LR, but once the gradient accumulation value goes up, I find I usually have to push the LR higher, as I think the gradient update is divided by the number of accumulations, unlike batch size. And higher LRs actually work amazingly well with Flux. The images I've been getting have rich colors and interesting camera angles (i.e. the Flux DPO seems to still be active), but also include the trained objects. They also have rich backgrounds with many scene-appropriate items appearing, a sign of network health. The pictures I get out from flux_minimal_inference.py (which is what I use for sample images) are honestly the kind of thing I'd put up on my art page, if it wasn't for the grid artifact appearing on them. Lower LRs haven't worked so well, as the trained objects don't get learned, and my images get boring colors and camera angles, showing the DPO has been lost. I get the feeling local minima are a real problem with image training, and the higher LRs can power through those. I really should get tensorboard up and running with my runs. I do wonder if maybe what we need for Flux is the ability to set the LR individually for each of the 19 double and the 38 single blocks. Maybe the average_key_norm and max_key_norm sizes would guide me towards which ones. Edit: I'm trying a 1e-4 run now, still with grad_accum=16. Maybe this LR will work well with Huber loss. |
I've been told that high gradient accumulation (with 16-bit floats) can cause numerical instability and/or loss of precision, not sure if that applies here. From tests I've made usually these types of visual artifacts tend to appear when I suspect it would be even more effective to set distinct |
Okay, so I have so more good news: my training run at 1e-4 with @recris's Huber loss went very well. Now I have some very high quality images, with the concept in the training images learned close to perfectly. (It's still training). The big thing to note is how long it took. Training with gradient accumulation 16 at 1e-4 took pretty much a whole day of training to get it to work. But it did make continuous steady progress throughout this time. Without Huber loss, I either found the learning of the training concept would get stuck and sample images would continually mis-draw the objects being learned, or that the sample images would lose their high image quality and draw boring scenes. That doesn't seem to be the case with Huber loss. @recris, I've heard things about high gradient accumulation being bad too, but it hasn't seem to have affected me here. Before Huber loss was implemented, I pushed up from accumulating 8 to accumulating 16 to try to reach better image quality, and that did seem to work. But yeah, gradient accumulation feels a bit suspect in general compared to increasing batch size. It does use less memory though, and another avantage to gradient accumulation that people often miss is that batch size only works on images that are in the same bucket for image dimensions. And often buckets only have single images in them, silently preventing batch size from working for them. Gradient accumulation is immune to that. Anyway, I'm glad that this PR has made it into the SD3 branch, as it seems worthwhile to me. |
All insights about how to use it are great - but having them in this PR makes it very hard to use them in future. |
@bmaltais can we have this i would like to test this ty |
Hey there @recris, I gave this a try (with 4.0), and yeah that lets me get to the higher LRs without those vertical lines, with the target object being learned at higher quality. Great suggestion. I hope we can get individual LRs for individual blocks at some point, like we had for SDXL with |
Okay, I found something I think is pretty cool, and I thought I'd put it here cause it relates to what's been said so far. I found that using I tried pushing up the value for I suspected that the issue was that the later single blocks were getting overtrained, and they can't deal with the higher key lengths - but the earlier double blocks were not overtrained and needed the longer key lengths available to them to be able to learn the trained object. So I added this code to
and changed the next couple of lines to use the new
This seemed to immediately get rid of the constrasty look, and let me increase my value for Maybe we need a |
Like you said before, we need a way to control the learning rate at block level. I find it too easy to create a burned Lora before it properly learns the target concept(s). Alternatively, there is a chance this could be mitigated with a better loss function - maybe the work in #294 should be revisited and translated to Flux training. |
Hey thanks for the pointer, @recris. I'm actually learning a complex object rather than a style, but maybe I'll want a style later. I found another advance to learning my object well that doesn't involve pushing up the key size further. I've discovered that instead of just setting my LoRA's alpha to be 0.5x the rank value, much higher alphas can work very well. Now I've got it around 1.67x the rank value I'm getting great results. |
After playing a bit more with the parameters I think some of the image artifacts from training with higher learning rates can be mitigated with a more aggressive decay in the exponential schedule. Something like After training with these I don't get as much "white band" artifacts on the image when I scale the LoRA strength past 100%, or use multiple LoRAs simultaneously. YMMV, this is from limited tests of mine. |
This change set implements pseudo Huber loss functionality that was missing from Flux.1 and SD3 model training.
This also introduces a new parameter
huber_scale
for controlling base threshold of the Huber function. The previous logic had a fixed threshold that worked very poorly with Flux.1 models; This is something that needs to be tuned to the loss "profile" of each model (mean abs latent error, variance, etc.). The existinghuber_c
could only be used for adjusting the decay of the exponential schedule, so a new parameter was required.Refactoring notes:
conditional_loss
function,huber_c
is no longer passed through the rest of the code; most of the file changes were due to this.Tested with:
For use with Flux.1 I recommend using
huber_scale = 1.8
(or a bit higher) andhuber_c = 0.25
. YMMV.